Robert’s Blog - Identifying Fake News using Natural Language Processing

The immense amount of information we are surrounded with today has sometimes made it difficult to distinguish truth from fiction. In order to combat this issue, Natural Language Processing (NLP) can help us identify fake news. In this post, we will explore various NLP techniques that can be used to classify fake news and improve the way we digest the information that we see.

First, we will begin our exploration by importing libraries. For our NLP library, we will be using nltk and for our deep learning model we will use tensorflow.

import nltk
from nltk.corpus import stopwords
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
import matplotlib.pyplot as plt
import re
import string

Next, we will retrieve our training data and convert it into a pandas dataframe. The dataframe will have the title and text of various news articles. For each article, it will also have a classification of whether or not the article is fake.

train_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_train.csv?raw=true"

data = pd.read_csv(train_url)

Next, we will define the function make_dataset which will preprocess the data before we put it into the NLP model. In this function we will download a list of stopwords from nltk. Stopwords are commonly used words in the English language, such as “is” or “at,” that are not as important to understanding the meaning of the text. We will remove these from the text and title and put these into a new column, which will be used in the model. We will also drop any rows with empty values for title, text, or fake. The function will return a Dataset that has title and text as inputs and fake as the output. Finally, we will batch the data to improve processing speed.

def make_dataset(df):
  # get list of stopwords
  nltk.download('stopwords')
  stop = stopwords.words('english')

  # remove stopwords from text and title
  df['text_wo_stopwords'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
  df['title_wo_stopwords'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

  # filter out examples with missing data
  df = df.dropna(subset=['title_wo_stopwords', 'text_wo_stopwords', 'fake'])
    
  return tf.data.Dataset.from_tensor_slices(({ 
        "title": df[["title_wo_stopwords"]],
        "text": df[["text_wo_stopwords"]]
    },
    # dictionary for output data/labels
    { 
        "fake": df[["fake"]]
    }   ))

tf_data = make_dataset(data)
tf_data = tf_data.batch(100)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Now that our data has been preprocessed, we can split it into a training and validation set. We will set aside 20% of the data for validation and use the rest for training.

train_size = int(0.8*len(tf_data)) 
val_size = int(0.2*len(tf_data))

train = tf_data.take(train_size) # data[:train_size]
val = tf_data.skip(train_size).take(val_size) # data[train_size : train_size + val_size]

Now that we are almost ready to train our model, we will need to determine the base rate to compare our model to. To calculate this, we will use the relative amount of fake news articles to determine what the accuracy would be if all articles were classified as either fake or real.

fake_count = 0
total_count = 0
for input,output in tf_data:
  fake_count = fake_count + int(sum(output["fake"]))
  total_count = total_count + len(output["fake"])

print(fake_count/total_count)

0.522963160942581

We can see here that fake news articles take up 52.3% of the data and are more common than real ones. Based on the distribution of the training data, the base rate of this model is 52.3%.

Title-Only Model

We are now ready to start training models. In our first model, we will only be looking at the titles of articles and we will try to classify whether an article is fake or real based on just that.

First, we will define a standardization function that makes all letters lowercase and removes any punctuation from the text.

def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation),'')
    return no_punctuation

The standardization function will then be put into a TextVectorization layer. We will then adapt this layer on our train data so that it learns to vectorize train.

max_tokens = 2000
size_vocabulary = 2000

title_vectorize_layer = layers.TextVectorization(
    standardize=standardization,
    max_tokens=size_vocabulary, # only consider this many words
    output_mode='int',
    output_sequence_length=500) 

title_vectorize_layer.adapt(train.map(lambda x, y: x["title"]))

WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

Now, we will use the Functional API to define our model that will process the data. The model will have an input layer, title_vectorize_layer (defined above), an embedding layer, a dropout layer, a global max pooling layer, and an output layer. Once we’ve define the model, we will compile the model.

# Define input layer
inputs = layers.Input(shape=(1,), dtype="string", name="title")

x = title_vectorize_layer(inputs)

# Add embedding layer
x = layers.Embedding(max_tokens, output_dim=3, name="embedding")(x)

# Add dropout layer
x = layers.Dropout(0.2)(x)

# Add global max pooling layer
x = layers.GlobalMaxPooling1D()(x)

# Add output layer
outputs = layers.Dense(2, name="fake")(x)

# Define model with inputs and outputs
model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

Below is a visualization of the layers of our model.

from tensorflow.keras import utils
utils.plot_model(model)

Let’s fit our model onto the data and see how it performs after twenty epochs.

history = model.fit(train, epochs=20, validation_data=val)

Epoch 1/20

/usr/local/lib/python3.9/dist-packages/keras/engine/functional.py:638: UserWarning: Input dict contained keys ['text'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)

180/180 [==============================] - 18s 75ms/step - loss: 0.6608 - accuracy: 0.7314 - val_loss: 0.6197 - val_accuracy: 0.8265
Epoch 2/20
180/180 [==============================] - 1s 6ms/step - loss: 0.5492 - accuracy: 0.8611 - val_loss: 0.5056 - val_accuracy: 0.8876
Epoch 3/20
180/180 [==============================] - 1s 6ms/step - loss: 0.4391 - accuracy: 0.8864 - val_loss: 0.4082 - val_accuracy: 0.9047
Epoch 4/20
180/180 [==============================] - 1s 7ms/step - loss: 0.3573 - accuracy: 0.8984 - val_loss: 0.3381 - val_accuracy: 0.9128
Epoch 5/20
180/180 [==============================] - 1s 6ms/step - loss: 0.3041 - accuracy: 0.9095 - val_loss: 0.2897 - val_accuracy: 0.9350
Epoch 6/20
180/180 [==============================] - 2s 9ms/step - loss: 0.2696 - accuracy: 0.9169 - val_loss: 0.2563 - val_accuracy: 0.9386
Epoch 7/20
180/180 [==============================] - 1s 7ms/step - loss: 0.2447 - accuracy: 0.9187 - val_loss: 0.2325 - val_accuracy: 0.9384
Epoch 8/20
180/180 [==============================] - 2s 12ms/step - loss: 0.2247 - accuracy: 0.9221 - val_loss: 0.2150 - val_accuracy: 0.9393
Epoch 9/20
180/180 [==============================] - 2s 10ms/step - loss: 0.2139 - accuracy: 0.9215 - val_loss: 0.2025 - val_accuracy: 0.9402
Epoch 10/20
180/180 [==============================] - 1s 8ms/step - loss: 0.2055 - accuracy: 0.9234 - val_loss: 0.1929 - val_accuracy: 0.9409
Epoch 11/20
180/180 [==============================] - 1s 6ms/step - loss: 0.2001 - accuracy: 0.9236 - val_loss: 0.1859 - val_accuracy: 0.9409
Epoch 12/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1934 - accuracy: 0.9239 - val_loss: 0.1802 - val_accuracy: 0.9422
Epoch 13/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1921 - accuracy: 0.9233 - val_loss: 0.1760 - val_accuracy: 0.9422
Epoch 14/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1865 - accuracy: 0.9263 - val_loss: 0.1725 - val_accuracy: 0.9418
Epoch 15/20
180/180 [==============================] - 1s 7ms/step - loss: 0.1840 - accuracy: 0.9251 - val_loss: 0.1699 - val_accuracy: 0.9431
Epoch 16/20
180/180 [==============================] - 2s 9ms/step - loss: 0.1809 - accuracy: 0.9253 - val_loss: 0.1677 - val_accuracy: 0.9422
Epoch 17/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1792 - accuracy: 0.9267 - val_loss: 0.1658 - val_accuracy: 0.9440
Epoch 18/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1764 - accuracy: 0.9273 - val_loss: 0.1641 - val_accuracy: 0.9443
Epoch 19/20
180/180 [==============================] - 1s 7ms/step - loss: 0.1764 - accuracy: 0.9276 - val_loss: 0.1627 - val_accuracy: 0.9438
Epoch 20/20
180/180 [==============================] - 1s 6ms/step - loss: 0.1753 - accuracy: 0.9282 - val_loss: 0.1615 - val_accuracy: 0.9447

# plot accuracies
epochs = range(1, len(history.history['val_accuracy']) + 1)
plt.plot(epochs,history.history['val_accuracy'], label='Validation accuracy')
plt.plot(epochs,history.history['accuracy'], label='Training accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title("Text-only Model")
plt.legend()
plt.show()

The validation accuracy of the Title-Only model appears to stabilize around 94%. This is significantly higher than the base rate of 52% and the model does not appear to be overfit, as the training accuracy never surpasses the validation accuracy.

Text-Only Model

Let’s try a model that only looks at the text and see if this performs better. We will use a similar layer as the one used for the Title-Only model, except this one will be trained to look at the text column of the dataframe.

max_tokens = 2000
size_vocabulary = 2000

text_vectorize_layer = layers.TextVectorization(
    standardize=standardization,
    max_tokens=size_vocabulary, # only consider this many words
    output_mode='int',
    output_sequence_length=500) 

text_vectorize_layer.adapt(train.map(lambda x, y: x["text"]))

Similar to our previous model, this model will have an input layer, text_vectorize_layer (defined above), an embedding layer, a dropout layer, a global max pooling layer, and an output layer. Once we’ve define the model, we will compile the model. However, the Dropout layer will be set to 0.3, since this improves the performance of the model.

# Define input layer
inputs = layers.Input(shape=(1,), dtype="string", name="text")

x = text_vectorize_layer(inputs)

# Add embedding layer
x = layers.Embedding(max_tokens, output_dim=3, name="embedding")(x)

# Add dropout layer
x = layers.Dropout(0.3)(x)

# Add global max pooling layer
x = layers.GlobalMaxPooling1D()(x)

# Add output layer
outputs = layers.Dense(2, name="fake")(x)

# Define model with inputs and outputs
model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

Here is a visualization of the Text-Only model layers. It looks very similar to our previous model.

from tensorflow.keras import utils
utils.plot_model(model)

Now, let’s fit the model on the train data and see how well it performs.

history = model.fit(train, epochs=20, validation_data=val)

Epoch 1/20
180/180 [==============================] - 10s 50ms/step - loss: 0.6643 - accuracy: 0.7543 - val_loss: 0.6384 - val_accuracy: 0.8112
Epoch 2/20
180/180 [==============================] - 3s 15ms/step - loss: 0.5576 - accuracy: 0.8213 - val_loss: 0.5448 - val_accuracy: 0.8377
Epoch 3/20
180/180 [==============================] - 2s 11ms/step - loss: 0.4613 - accuracy: 0.8356 - val_loss: 0.4742 - val_accuracy: 0.8431
Epoch 4/20
180/180 [==============================] - 2s 11ms/step - loss: 0.4070 - accuracy: 0.8425 - val_loss: 0.4324 - val_accuracy: 0.8429
Epoch 5/20
180/180 [==============================] - 2s 11ms/step - loss: 0.3792 - accuracy: 0.8468 - val_loss: 0.4073 - val_accuracy: 0.8481
Epoch 6/20
180/180 [==============================] - 2s 11ms/step - loss: 0.3615 - accuracy: 0.8500 - val_loss: 0.3884 - val_accuracy: 0.8591
Epoch 7/20
180/180 [==============================] - 2s 13ms/step - loss: 0.3502 - accuracy: 0.8518 - val_loss: 0.3759 - val_accuracy: 0.8609
Epoch 8/20
180/180 [==============================] - 2s 12ms/step - loss: 0.3393 - accuracy: 0.8577 - val_loss: 0.3662 - val_accuracy: 0.8591
Epoch 9/20
180/180 [==============================] - 2s 11ms/step - loss: 0.3336 - accuracy: 0.8554 - val_loss: 0.3569 - val_accuracy: 0.8597
Epoch 10/20
180/180 [==============================] - 2s 11ms/step - loss: 0.3156 - accuracy: 0.8669 - val_loss: 0.3264 - val_accuracy: 0.8874
Epoch 11/20
180/180 [==============================] - 2s 11ms/step - loss: 0.3001 - accuracy: 0.8748 - val_loss: 0.3164 - val_accuracy: 0.8959
Epoch 12/20
180/180 [==============================] - 2s 12ms/step - loss: 0.2938 - accuracy: 0.8780 - val_loss: 0.3105 - val_accuracy: 0.8957
Epoch 13/20
180/180 [==============================] - 2s 13ms/step - loss: 0.2889 - accuracy: 0.8779 - val_loss: 0.3049 - val_accuracy: 0.8959
Epoch 14/20
180/180 [==============================] - 2s 11ms/step - loss: 0.2862 - accuracy: 0.8821 - val_loss: 0.2989 - val_accuracy: 0.8975
Epoch 15/20
180/180 [==============================] - 2s 10ms/step - loss: 0.2800 - accuracy: 0.8837 - val_loss: 0.2952 - val_accuracy: 0.8984
Epoch 16/20
180/180 [==============================] - 2s 10ms/step - loss: 0.2799 - accuracy: 0.8857 - val_loss: 0.2932 - val_accuracy: 0.8993
Epoch 17/20
180/180 [==============================] - 2s 13ms/step - loss: 0.2835 - accuracy: 0.8801 - val_loss: 0.2918 - val_accuracy: 0.9004
Epoch 18/20
180/180 [==============================] - 2s 12ms/step - loss: 0.2783 - accuracy: 0.8821 - val_loss: 0.2899 - val_accuracy: 0.9000
Epoch 19/20
180/180 [==============================] - 2s 11ms/step - loss: 0.2737 - accuracy: 0.8859 - val_loss: 0.2884 - val_accuracy: 0.8995
Epoch 20/20
180/180 [==============================] - 2s 11ms/step - loss: 0.2764 - accuracy: 0.8833 - val_loss: 0.2873 - val_accuracy: 0.8998

epochs = range(1, len(history.history['val_accuracy']) + 1)
plt.plot(epochs,history.history['val_accuracy'], label='Validation accuracy')
plt.plot(epochs,history.history['accuracy'], label='Training accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title("Text-only Model")
plt.legend()
plt.show()

The validation accuracy of the Text-Only model appears to stabilize around 89%. This is significantly higher than the base rate of 52%, but is not as good as the accuracy of the Title-Only model. Nevertheless, the model does not appear to be overfit, as the training accuracy never significantly surpasses the validation accuracy.

Title and Text Model

Finally, we will try a model that takes both the title and text as inputs. We will use the same layers that were previously defined in the individual models above. However, we will create a single embedding layer that will be shared for the two inputs. Additionally, we will add a dense layer towards the end so that the text and title layers can be combined into a single layer.

# Title layers
title_inputs = layers.Input(shape=(1,), dtype="string", name="title")
title_feat = title_vectorize_layer(title_inputs)

# Add embedding layer
shared_embedding = layers.Embedding(max_tokens, output_dim=3, name="embedding")

title_feat = shared_embedding(title_feat)


# Add dropout layer
title_feat = layers.Dropout(0.2)(title_feat)

# Add global max pooling layer
title_feat = layers.GlobalMaxPooling1D()(title_feat)

# Add dropout layer
title_feat = layers.Dropout(0.1)(title_feat)


# Text layers

text_inputs = layers.Input(shape=(1,), dtype="string", name="text")
text_feat = text_vectorize_layer(text_inputs)
text_feat = shared_embedding(text_feat)

# Add dropout layer
text_feat = layers.Dropout(0.2)(text_feat)

# Add global max pooling layer
text_feat = layers.GlobalMaxPooling1D()(text_feat)

# Add dropout layer
text_feat = layers.Dropout(0.1)(text_feat)

main = layers.concatenate([title_feat, text_feat], axis = 1)
main = layers.Dense(32, activation='relu')(main)

# Add output layer
outputs = layers.Dense(2, name="fake")(main)

# Define model with inputs and outputs
model = tf.keras.Model(inputs=[title_inputs, text_inputs], 
                       outputs=outputs)

model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

Below is a visualization of the layers of our model. There are two inputs which are processed, then combined and processed further to create a prediction of whether or not the article is fake.

from tensorflow.keras import utils
utils.plot_model(model)

Let’s fit the model and see how well it performed.

history = model.fit(train, epochs=20, validation_data=val)

Epoch 1/20
180/180 [==============================] - 11s 55ms/step - loss: 0.6205 - accuracy: 0.6692 - val_loss: 0.4270 - val_accuracy: 0.9402
Epoch 2/20
180/180 [==============================] - 2s 13ms/step - loss: 0.2699 - accuracy: 0.9131 - val_loss: 0.1531 - val_accuracy: 0.9537
Epoch 3/20
180/180 [==============================] - 4s 21ms/step - loss: 0.1774 - accuracy: 0.9326 - val_loss: 0.1095 - val_accuracy: 0.9636
Epoch 4/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1550 - accuracy: 0.9412 - val_loss: 0.0949 - val_accuracy: 0.9679
Epoch 5/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1379 - accuracy: 0.9528 - val_loss: 0.0887 - val_accuracy: 0.9708
Epoch 6/20
180/180 [==============================] - 2s 12ms/step - loss: 0.1324 - accuracy: 0.9546 - val_loss: 0.0859 - val_accuracy: 0.9719
Epoch 7/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1266 - accuracy: 0.9579 - val_loss: 0.0834 - val_accuracy: 0.9710
Epoch 8/20
180/180 [==============================] - 3s 18ms/step - loss: 0.1204 - accuracy: 0.9591 - val_loss: 0.0818 - val_accuracy: 0.9719
Epoch 9/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1146 - accuracy: 0.9609 - val_loss: 0.0806 - val_accuracy: 0.9726
Epoch 10/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1164 - accuracy: 0.9596 - val_loss: 0.0798 - val_accuracy: 0.9717
Epoch 11/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1122 - accuracy: 0.9621 - val_loss: 0.0809 - val_accuracy: 0.9721
Epoch 12/20
180/180 [==============================] - 3s 14ms/step - loss: 0.1138 - accuracy: 0.9610 - val_loss: 0.0811 - val_accuracy: 0.9724
Epoch 13/20
180/180 [==============================] - 3s 17ms/step - loss: 0.1057 - accuracy: 0.9624 - val_loss: 0.0789 - val_accuracy: 0.9726
Epoch 14/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1091 - accuracy: 0.9627 - val_loss: 0.0803 - val_accuracy: 0.9730
Epoch 15/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1052 - accuracy: 0.9648 - val_loss: 0.0797 - val_accuracy: 0.9744
Epoch 16/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1096 - accuracy: 0.9633 - val_loss: 0.0804 - val_accuracy: 0.9746
Epoch 17/20
180/180 [==============================] - 3s 15ms/step - loss: 0.1060 - accuracy: 0.9658 - val_loss: 0.0805 - val_accuracy: 0.9744
Epoch 18/20
180/180 [==============================] - 3s 16ms/step - loss: 0.1077 - accuracy: 0.9651 - val_loss: 0.0801 - val_accuracy: 0.9733
Epoch 19/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1060 - accuracy: 0.9639 - val_loss: 0.0803 - val_accuracy: 0.9744
Epoch 20/20
180/180 [==============================] - 2s 13ms/step - loss: 0.1045 - accuracy: 0.9657 - val_loss: 0.0791 - val_accuracy: 0.9739

epochs = range(1, len(history.history['val_accuracy']) + 1)
plt.plot(epochs,history.history['val_accuracy'], label='Validation accuracy')
plt.plot(epochs,history.history['accuracy'], label='Training accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title("Title and Text Model")
plt.legend()
plt.show()

The validation accuracy of the Title and Text model appears to stabilize around 97%. This is significantly higher than the base rate of 54% and outperforms both the Title-Only and Text-Only models. Additionally, the model does not appear to be overfit, as the training accuracy never significantly surpasses the validation accuracy.

Model Evaluation

Since the Title and Text model performed best, we will use that model for our model evaluation. We will fetch and process this data similarly to how we worked with the training data.

test_url = "https://github.com/PhilChodrow/PIC16b/blob/master/datasets/fake_news_test.csv?raw=true"

test_data = pd.read_csv(test_url)

tf_test = make_dataset(test_data)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Now that we have this data, we will use the evaluate() function to see how the model performs on unseen test data.

loss, acc = model.evaluate(tf_test.batch(32))
print('Test accuracy:', acc)

702/702 [==============================] - 4s 5ms/step - loss: 0.0927 - accuracy: 0.9708
Test accuracy: 0.970778226852417

The test accuracy is 97%, which is similar to the validation accuracy that we saw when fitting the model. This further confirms that the model is not overfit and performs fairly well in classifying articles as fake or real.

Visualizing Embedddings

Now that we have a working model, we can take a look into the embedding layer to see how the model tried to group together words. Visualizing word embeddings can provide insights into how words are related to each other in a high-dimensional space.

Below, we will get the words and weights from the embedding layer.

vocab = text_vectorize_layer.get_vocabulary() # keeps track of mapping from word to integer

weights = model.get_layer("embedding").get_weights()[0]
weights.shape # 2000 vocabs x 3 dimensional space

(2000, 3)

Now, we will use PCA to reduce the dimensionality so that we can view the embeddings in a 2-D space. Once we have created the dimensionality reduction, we can plot the embeddings using plotly.

from sklearn.decomposition import PCA

# apply PCA to reduce dimensions
pca = PCA(n_components=2)
weights = pca.fit_transform(weights)

embedding_df = pd.DataFrame({
    'word': vocab,
    'x0': weights[:, 0],
    'x1': weights[:, 1]
})

# plot word embeddings
import plotly.express as px
pio.renderers.default = "png"
fig = px.scatter(embedding_df,
                 x='x0',
                 y='x1',
                 size=[2]*len(embedding_df),
                 #size_max = 2,
                 hover_name = 'word' ,
                 title = "Word Embeddings"
                 )
fig.show(renderer = "notebook")

This graph shows the words that were commonly associated with each other, which helped the model make a prediction.

For example, the words “candidate,” “immigration,” and “president” are close to each other on the word embedding plot, suggesting that these words are frequently used in the context of fake news articles. The word “candidate” could be associated with fake news if it refers to a political candidate who is being falsely accused of something or if a fake news article is spreading false information about a candidate. The word “immigration” could come from an article spreading false information about immigration policies or immigrants. Lastly, the word “president” could come from text spreading false information about the actions or statements of the president.

We can also see another region towards the bottom of the plot that talks about foreign politics with words such as “turkish,” “bloc,” and “vladmir.” Given this, it is possible that some of these words are more likely to be seen in fake news articles since these are often produced by foreign countries seeking to influence opinions on global politics.